llm hallucination
Calibrated Trust in Dealing with LLM Hallucinations: A Qualitative Study
Ryser, Adrian, Allwein, Florian, Schlippe, Tim
Hallucinations are outputs by Large Language Models (LLMs) that are factually incorrect yet appear plausible [1]. This paper investigates how such hallucinations influence users' trust in LLMs and users' interaction with LLMs. To explore this in everyday use, we conducted a qualitative study with 192 participants. Our findings show that hallucinations do not result in blanket mistrust but instead lead to context-sensitive trust calibration. Building on the calibrated trust model by Lee & See [2] and Afroogh et al.'s trust-related factors [3], we confirm expectancy [3], [4], prior experience [3], [4], [5], and user expertise & domain knowledge [3], [4] as userrelated (human) trust factors, and identify intuition as an additional factor relevant for hallucination detection. Additionally, we found that trust dynamics are further influenced by contextual factors, particularly perceived risk [3] and decision stakes [6]. Consequently, we validate the recursive trust calibration process proposed by Blöbaum [7] and extend it by including intuition as a user-related trust factor. Based on these insights, we propose practical recommendations for responsible and reflective LLM use.
- Europe (0.68)
- North America > United States > Massachusetts (0.28)
Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries
Hagar, Nick, Agustianto, Wilma, Diakopoulos, Nicholas
Large language models (LLMs) are increasingly used in newsroom workflows, but their tendency to hallucinate poses risks to core journalistic practices of sourcing, attribution, and accuracy. We evaluate three widely used tools - ChatGPT, Gemini, and NotebookLM - on a reporting-style task grounded in a 300-document corpus related to TikTok litigation and policy in the U.S. We vary prompt specificity and context size and annotate sentence-level outputs using a taxonomy to measure hallucination type and severity. Across our sample, 30% of model outputs contained at least one hallucination, with rates approximately three times higher for Gemini and ChatGPT (40%) than for NotebookLM (13%). Qualitatively, most errors did not involve invented entities or numbers; instead, we observed interpretive overconfidence - models added unsupported characterizations of sources and transformed attributed opinions into general statements. These patterns reveal a fundamental epistemological mismatch: While journalism requires explicit sourcing for every claim, LLMs generate authoritative-sounding text regardless of evidentiary support. We propose journalism-specific extensions to existing hallucination taxonomies and argue that effective newsroom tools need architectures that enforce accurate attribution rather than optimize for fluency.
- Europe (0.94)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
Purposefully Induced Psychosis (PIP): Embracing Hallucination as Imagination in Large Language Models
Pilcher, Kris, Tütüncü, Esen K.
Hallucinations in Large Language Models (LLMs) are widely regarded as errors - outputs that deviate from factual accuracy. However, in creative or exploratory contexts, these "mistakes" may represent unexpected avenues for innovation. We introduce Purposefully Induced Psychosis (PIP), a novel approach that amplifies LLM hallucinations for imaginative tasks such as speculative fiction, interactive storytelling, and mixed-reality simulations. Drawing on Herman Melville's Moby-Dick, where Pip's "madness" reveals profound insight, we reframe hallucinations as a source of computational imagination rather than a flaw. Our method fine-tunes LLMs to encourage speculative, metaphorical, and surreal outputs - hallucinations that are useful when factual accuracy is not the chief objective. Inspired by the consensual illusions of theater and stage magic, PIP situates these creative missteps in contexts where users willingly suspend disbelief, thereby transforming "errors" into catalysts for new ways of thinking. We discuss potential applications, design principles for ensuring user consent, preliminary observations, and implications for broader AI ethics and human-AI collaboration.
- Europe (0.28)
- North America > United States > Massachusetts (0.14)
OAEI-LLM-T: A TBox Benchmark Dataset for Understanding LLM Hallucinations in Ontology Matching Systems
Hallucinations are inevitable in downstream tasks using large language models (LLMs). While addressing hallucinations becomes a substantial challenge for LLM-based ontology matching (OM) systems, we introduce a new benchmark dataset called OAEI-LLM-T. The dataset evolves from the TBox (i.e. schema-matching) datasets in the Ontology Alignment Evaluation Initiative (OAEI), capturing hallucinations of different LLMs performing OM tasks. These OM-specific hallucinations are carefully classified into two primary categories and six sub-categories. We showcase the usefulness of the dataset in constructing the LLM leaderboard and fine-tuning foundational LLMs for LLM-based OM systems.
- Europe > Greece > Attica > Athens (0.04)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (5 more...)
R$^2$: A LLM Based Novel-to-Screenplay Generation Framework with Causal Plot Graphs
Lin, Zefeng, Xiao, Yi, Mo, Zhiqiang, Zhang, Qifan, Wang, Jie, Chen, Jiayang, Zhang, Jiajing, Zhang, Hui, Liu, Zhengyi, Fang, Xianyong, Xu, Xiaohua
Published as a conference paper at ICLR 2025R 2: A LLM B ASED N OVEL-TO-S CREENPLAYG ENER-ATIONF RAMEWORK WITH C AUSALP LOT G RAPHS Zefeng Lin 1, Yi Xiao 1, Zhiqiang Mo 1, Qifan Zhang 1, Jie Wang 2, Jiayang Chen 2, Jiajing Zhang 2, Hui Zhang 1, Zhengyi Liu 3, Xianyong Fang 3, Xiaohua Xu 1 1 University of Science and Technology of China 2 Anhui Jianzhu University 3 Anhui University A BSTRACT Automatically adapting novels into screenplays is important for the TV, film, or opera industries to promote products with low costs. The strong performances of large language models (LLMs) in long-text generation call us to propose a LLM based framework Reader-Rewriter (R 2) for this task. However, there are two fundamental challenges here. First, the LLM hallucinations may cause inconsistent plot extraction and screenplay generation. Second, the causality-embedded plot lines should be effectively extracted for coherent rewriting. Therefore, two corresponding tactics are proposed: 1) A hallucination-aware refinement method (HAR) to iteratively discover and eliminate the affections of hallucinations; and 2) a causal plot-graph construction method (CPC) based on a greedy cycle-breaking algorithm to efficiently construct plot lines with event causalities. Recruiting those efficient techniques, R 2 utilizes two modules to mimic the human screenplay rewriting process: The Reader module adopts a sliding window and CPC to build the causal plot graphs, while the Rewriter module generates first the scene outlines based on the graphs and then the screenplays. HAR is integrated into both modules for accurate inferences of LLMs.
- Asia > China (0.24)
- North America > United States > Colorado > Weld County > Evans (0.04)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)
OAEI-LLM: A Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching
Qiang, Zhangcheng, Taylor, Kerry, Wang, Weiqing, Jiang, Jing
Hallucinations of large language models (LLMs) commonly occur in domain-specific downstream tasks, with no exception in ontology matching (OM). The prevalence of using LLMs for OM raises the need for benchmarks to better understand LLM hallucinations. The OAEI-LLM dataset is an extended version of the Ontology Alignment Evaluation Initiative (OAEI) datasets that evaluate LLM-specific hallucinations in OM tasks. We outline the methodology used in dataset construction and schema extension, and provide examples of potential use cases.
- Oceania > Australia > Victoria > Melbourne (0.04)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Switzerland (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)
MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models
Agarwal, Vibhor, Jin, Yiqiao, Chandra, Mohit, De Choudhury, Munmun, Kumar, Srijan, Sastry, Nishanth
The remarkable capabilities of large language models (LLMs) in language understanding and generation have not rendered them immune to hallucinations. LLMs can still generate plausible-sounding but factually incorrect or fabricated information. As LLM-empowered chatbots become popular, laypeople may frequently ask health-related queries and risk falling victim to these LLM hallucinations, resulting in various societal and healthcare implications. In this work, we conduct a pioneering study of hallucinations in LLM-generated responses to real-world healthcare queries from patients. We propose MedHalu, a carefully crafted first-of-its-kind medical hallucination dataset with a diverse range of health-related topics and the corresponding hallucinated responses from LLMs with labeled hallucination types and hallucinated text spans. We also introduce MedHaluDetect framework to evaluate capabilities of various LLMs in detecting hallucinations. We also employ three groups of evaluators -- medical experts, LLMs, and laypeople -- to study who are more vulnerable to these medical hallucinations. We find that LLMs are much worse than the experts. They also perform no better than laypeople and even worse in few cases in detecting hallucinations. To fill this gap, we propose expert-in-the-loop approach to improve hallucination detection through LLMs by infusing expert reasoning. We observe significant performance gains for all the LLMs with an average macro-F1 improvement of 6.3 percentage points for GPT-4.
Code Hallucination
Rahman, Mirza Masfiqur, Kundu, Ashish
Generative models such as large language models are extensively used as code copilots and for whole program generation. However, the programs they generate often have questionable correctness, authenticity and reliability in terms of integration as they might not follow the user requirements, provide incorrect and/or nonsensical outputs, or even contain semantic/syntactic errors - overall known as LLM hallucination. In this work, we present several types of code hallucination. We have generated such hallucinated code manually using large language models. We also present a technique - HallTrigger, in order to demonstrate efficient ways of generating arbitrary code hallucination. Our method leverages 3 different dynamic attributes of LLMs to craft prompts that can successfully trigger hallucinations from models without the need to access model architecture or parameters. Results from popular blackbox models suggest that HallTrigger is indeed effective and the pervasive LLM hallucination have sheer impact on software development.
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > California > Santa Clara County > San Jose (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Leveraging Graph Structures to Detect Hallucinations in Large Language Models
Nonkes, Noa, Agaronian, Sergei, Kanoulas, Evangelos, Petcu, Roxana
Large language models are extensively applied across a wide range of tasks, such as customer support, content creation, educational tutoring, and providing financial guidance. However, a well-known drawback is their predisposition to generate hallucinations. This damages the trustworthiness of the information these models provide, impacting decision-making and user confidence. We propose a method to detect hallucinations by looking at the structure of the latent space and finding associations within hallucinated and non-hallucinated generations. We create a graph structure that connects generations that lie closely in the embedding space. Moreover, we employ a Graph Attention Network which utilizes message passing to aggregate information from neighboring nodes and assigns varying degrees of importance to each neighbor based on their relevance. Our findings show that 1) there exists a structure in the latent space that differentiates between hallucinated and non-hallucinated generations, 2) Graph Attention Networks can learn this structure and generalize it to unseen generations, and 3) the robustness of our method is enhanced when incorporating contrastive learning. When evaluated against evidence-based benchmarks, our model performs similarly without access to search-based methods.
- Asia > Singapore (0.04)
- North America > United States (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild
Zhu, Zhiying, Yang, Yiming, Sun, Zhiqing
Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from existing real-world user-LLM interaction datasets, including ShareGPT, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and synthesize the reference answers with the powerful GPT-4 model and retrieval-augmented generation (RAG). Our benchmark offers a novel approach towards enhancing our comprehension and improvement of LLM reliability in scenarios reflective of real-world interactions. Our benchmark is available at https://github.com/Dianezzy/HaluEval-Wild.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)
- Europe > Italy (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)